Skip to content

Fix async storage retry response drain ordering#49581

Open
arnabnandy7 wants to merge 8 commits into
Azure:mainfrom
arnabnandy7:fix/blob-download-hang
Open

Fix async storage retry response drain ordering#49581
arnabnandy7 wants to merge 8 commits into
Azure:mainfrom
arnabnandy7:fix/blob-download-hang

Conversation

@arnabnandy7

Copy link
Copy Markdown

Description

Fixes #49507.

This PR fixes an async retry ordering issue in RequestRetryPolicy where a retryable storage response body was captured, the response was closed, and only then the body was subscribed to for draining before retrying.

With Reactor Netty, closing the response first can discard the inbound stream and recycle the connection before the drain subscription is attached. In parallel blob downloads, that can cause a retryable chunk response, such as HTTP 500, to hang indefinitely instead of retrying or failing.

The retry path now drains the response body before closing the response, and closes it afterward via doFinally. Responses without a body are still closed immediately before retrying.

A regression test was added to verify that the async retry path subscribes to the retryable response body before closing it.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • [] CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Tested with:

mvn -f sdk\storage\azure-storage-common\pom.xml -Dtest=RequestRetryPolicyTest test

Copilot AI review requested due to automatic review settings June 21, 2026 16:38
@github-actions github-actions Bot added Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Storage Storage Service (Queues, Blobs, Files) labels Jun 21, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Thank you for your contribution @arnabnandy7! We will review the pull request and get back to you soon.

@arnabnandy7

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an async retry ordering issue in RequestRetryPolicy to ensure retryable storage responses have their bodies drained (subscribed) before the response is closed, preventing hangs with Reactor Netty connection recycling during parallel downloads.

Changes:

  • Reordered async retry handling to drain the response body before closing the response.
  • Added a regression test validating the drain-vs-close ordering in the async retry path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/policy/RequestRetryPolicy.java Adjusts async retry flow to drain the body before closing the response.
sdk/storage/azure-storage-common/src/test/java/com/azure/storage/common/policy/RequestRetryPolicyTest.java Adds an async regression test covering drain-before-close behavior.

arnabnandy7 and others added 2 commits June 21, 2026 22:13
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@ibrandes ibrandes left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this fix! the doFinally ordering and Mono.defer are both the right choices here and the test approach of switching body behavior at subscription time cleanly isolates the race. a few things worth addressing before merging:

  • your changes were missing onErrorResume -- without it, a drain error bypasses the retry entirely and surfaces to the caller. adding .onErrorResume(drainError -> Mono.empty()) between doFinally and then closes this gap.
  • another test case should be added for but the skipped-retry interleaving (Flux.error(ISE) after close()) behavior
  • a couple of nits -- the retryResponse alias is unnecessary; FAST_RETRY_OPTIONS with a 1 ms delay makes the StepVerifier timeout more meaningful as a hang detector.

please re-request review when you've addressed the open comments. also, feel free to add an entry for the fix in azure-storage-common\CHANGELOG.md under ### Bugs Fixed (line 9) -- happy to add it myself though if you'd prefer to leave that to us.

@arnabnandy7 arnabnandy7 requested a review from ibrandes June 21, 2026 19:36
@ibrandes

Copy link
Copy Markdown
Member

/azp run java - storage - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Storage Storage Service (Queues, Blobs, Files)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] BlobClient.downloadToFileWithResponse hangs forever after an HTTP 500 on one chunk

3 participants